WEEK 2: TIDY DATA + BASICS OF GRAPHICS

Tuesday, January 16th

No Class! Check Canvas Announcement for details

  • Introduce “Game Planning”
  • Review text + check-in 2.1 + 2.2
  • Begin PA 2: Using Data Visualization to Find the Penguins

“Game Planning”

What: Game Plans! are strategic guides that prompt you to map your coding strategies before implementation.

How: Your favorite sketch app, paper + pencil, online whiteboard (Excalidraw!)

Why: Dr. Rehnberg and I will be collecting and saving game plans through your Canvas assignment submissions to use in work related to investigating the efficacy of using “game plans” as a pedagogical approach in the context of statistical computing education.

Informed Consent Please complete the informed consent for game plans on Canvas by Thursday 1/18 at 11:59pm.

Creating a Graphic

To create a specific type of graphic, we will combine aesthetics and geometric objects.


Let’s try it!

Start with the TX housing data.

Make a plot of median house price over time (including both individual data points and a smoothed trend line ), distinguishing between different cities .

Code
ggplot(data = txhousing, 
       mapping = aes(x = date, 
                     y = median, 
                     color = city
                     )
       ) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

PA 2: Using Data Visualization to Find the Penguins

Artwork by Allison Horst

To do…

  • “Game Plan” Informed Consent

  • PA 2: Using Data Visualization to Find the Penguins

    • Due Thursday (1/18) at 11:59pm
  • Ugly Graphics of Penguins (up to +9 FP)

    • Due Monday (1/22) at 11:59pm

Note

I have my Monday office hours Tuesday (1/16) from 3:10-4pm.

Thursday, January 18th

Today we will…

  • Let’s talk about “Game Plans”
  • Update on Flex Point Opportunities
  • Review
    • Questions from Reading
    • PA 2: Using Data Visualization to Find the Penguins
  • New Material
    • Tidy Data
    • Load External Data
    • Graphics (and ggplot2)
    • What makes a good graphic?
  • Lab 2: Exploring Rodents with ggplot2
  • Strike: What happens next week?

Tidy Data

Tidy Data

Artwork by Allison Horst

Working with External Data

Common Types of Data Files

Look at the file extension for the type of data file.

.csv : “comma-separated values”

Name, Age
Bob, 49
Joe, 40

.xls, .xlsx: Microsoft Excel spreadsheet

  • Common approach: save as .csv
  • Nicer approach: use the readxl package

.txt: plain text

  • Could have any sort of delimiter…
  • Need to let R know what to look for!

Common Types of Data Files

Loading External Data

Using base R functions:

  • read.csv() is for reading in .csv files.

  • read.table() and read.delim() are for any data with “columns” (you specify the separator).

Loading External Data

The tidyverse has some cleaned-up versions in the readr and readxl packages:

  • read_csv() is for comma-separated data.

  • read_tsv() is for tab-separated data.

  • read_table() is for white-space-separated data.

  • read_delim() is any data with “columns” (you specify the separator). The above are special cases.

  • read_excel() is specifically for dealing with Excel files.

Remember to load the readr and readxl packages first!

Grammar of Graphics

Grammar of Graphics

The Grammar of Graphics (GoG) is a principled way of specifying exactly how to create a particular graph from a given data set. It helps us to systematically design new graphs.


Think of a graph or a data visualization as a mapping…

FROM variables in the data set (or statistics computed from the data)…

TO visual attributes (or “aesthetics”) of marks (or “geometric elements”) on the page/screen.

Why Grammar of Graphics?

  • It’s more flexible than a “chart zoo” of named graphs.
  • The software understands the structure of your graph.
  • It easily automates graphing of data subsets.

ggplot2: elegant graphics for data analysis by Hadley Wickham

The grammar makes it easier for you to iteratively update a plot, changing a single feature at a time. The grammar is also useful because it suggests the high-level aspects of a plot that can be changed, giving you a framework to think about graphics, and hopefully shortening the distance from mind to paper. It also encourages the use of graphics customised to a particular problem, rather than relying on specific chart types.

Components of Grammar of Graphics

  • data: dataframe containing variables
  • aes : aesthetic mappings (position, color, symbol, …)
  • geom : geometric element (point, line, bar, box, …)
  • stat : statistical variable transformation (identity, count, linear model, quantile, …)
  • scale : scale transformation (log scale, color mapping, axes tick breaks, …)
  • coord : Cartesian, polar, map projection, …
  • facet : divide into subplots using a categorical variable

Using ggplot2

How to Build a Graphic

Complete this template to build a basic graphic:


  • We use + to add layers to a graphic.

This begins a plot that you can add layers to:

ggplot(data = mpg)

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       )

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter()

ggplot(data = mpg, 
       aes(x = class, y = hwy)
       ) +
  geom_jitter() +
  geom_boxplot()

How would you make the points be on top of the boxplots?

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Aesthetics

We map variables (columns) from the data to aesthetics on the graphic useing the aes() function.

What aesthetics can we set (see ggplot2 cheat sheet for more)?

  • x, y
  • color, fill
  • linetype
  • lineend
  • size
  • shape

Special Properties of Aesthetics

Global Aesthetics

ggplot(data = housingsub, 
       mapping = aes(x = date, 
                     y = median)
       ) +
  geom_point()

Local Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median)
             )

Mapping Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median,
                           color = city)
             )

Setting Aesthetics

ggplot(data = housingsub) +
  geom_point(mapping = aes(x = date, 
                           y = median), 
             color = "blue"
               )

Geometric Objects

Wee use a geom_xxx() function to represent data points.

one variable

  • geom_density()
  • geom_dotplot()
  • geom_histogram()
  • geom_boxplot()

two variable

  • geom_point()
  • geom_line()
  • geom_density_2d()

three variable

  • geom_contour()
  • geom_raster()

Not an exhaustive list – see ggplot2 cheat sheet.

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_point() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_text(aes(label = class)) +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Code
ggplot(data = mpg,
       aes(x = cty,
           y = hwy,
           color = class)
       ) +
  geom_line() +
  labs(x = "City (mpg)", y = "Highway (mpg)") +
  theme(axis.title = element_text(size = 14),
        legend.title = element_blank(),
        legend.text = element_text(size = 14))

Creating a Graphic

To create a specific type of graphic, we will combine aesthetics and geometric objects.


Let’s try it!

Start with the TX housing data.

Make a plot of median house price over time (including both individual data points and a smoothed trend line ), distinguishing between different cities .

Code
ggplot(data = txhousing, aes(x = date, y = median, color = city)) + 
  geom_point() + 
  geom_smooth(method = "loess") + 
  labs(x = "Date",
       y = "Median Home Price",
       title = "Texas Housing Prices")

Statistical Transformation: stat

A stat transforms an existing variable into a new variable to plot.

  • identity leaves the data as is.
  • count counts the number of observations.
  • summary allows you to specify a desired transformation function.

Sometimes these statistical transformations happen under the hood when we call a geom.

Statistical Transformation: stat

ggplot(data = mpg,
       mapping = aes(x = class)) +
  geom_bar()

ggplot(data = mpg,
       mapping = aes(x = class)) +
  stat_count(geom = "bar")

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "mean") +
  scale_y_continuous(limits = c(0,45))

ggplot(data = mpg,
       mapping = aes(x = class,
                     y = hwy)) +
  stat_summary(geom = "bar",
               fun = "max") +
  scale_y_continuous(limits = c(0,45))

Faceting

Extracts subsets of data and places them in side-by-side graphics.

ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) + 
  geom_point() +
  facet_grid(.~class)

ggplot(data = mpg, aes(x = cty, y = hwy, color = class)) + 
  geom_point() +
  facet_wrap(.~class)

  • facet_grid(. ~ b): facet into columns based on b
  • facet_grid(a ~ .): facet into rows based on a
  • facet_grid(a ~ b): facet into both rows and columns
  • facet_wrap( ~ b): wrap facets into a rectangular layout

You can set scales to let axis limits vary across facets:

facet_grid(y ~ x, scales = ______)

  • "free" – both x- and y-axis limits adjust to individual facets
  • "free_x" – only x-axis limits adjust
  • "free_y" – only y-axis limits adjust

You can set a labeller to adjust facet labels:

  • facet_grid(. ~ fl, labeller = label_both)
  • facet_grid(. ~ fl, labeller = label_bquote(alpha ^ .(x)))
  • facet_grid(. ~ fl, labeller = label_parsed)

Position Adjustements

Position adjustments determine how to arrange geom’s that would otherwise occupy the same space.

  • position = 'dodge': Arrange elements side by side.
  • position = 'fill': Stack elements on top of one another + normalize height.
  • position = 'stack': Stack elements on top of one another.
  • position = 'jitter": Add random noise to X & Y position of each element to avoid overplotting (see geom_jitter()).

Position Adjustements

ggplot(mpg, aes(fl, fill = drv)) + 
  geom_bar(position = "")`

Plot Customizations

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x = "Engine Displacement (liters)", 
       y = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(xlab = "Engine Displacement (liters)", 
       ylab = "Highway MPG", 
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  theme_bw() +
  theme(legend.position = "bottom")

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x     = "Engine Displacement (liters)",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_y_continuous("Highway MPG", 
                     limits = c(0,50),
                     breaks = seq(0,50,5)
                     )

Code
ggplot(data = mpg) + 
  geom_jitter(mapping = aes(x = displ, y = hwy, color = cyl)) + 
  labs(x    = "Engine Displacement (liters)",
       y    = "Highway MPG",
       color = "Number of \nCylinders",
       title = "Car Efficiency") +
  scale_color_gradient(low = "white", high = "green4")

Formatting your Plot Code

It is good practice to put each geom and aes on a new line.

  • This makes code easier to read!
  • Generally: no line of code should be over 80 characters long.
ggplot(data = mpg, mapping = aes(x = cty, y = hwy, color = class)) + geom_point() + theme_bw() + labs(x = "City (mpg)", y = "Highway (mpg)")
ggplot(data = mpg, 
       mapping = aes(x = cty, 
                     y = hwy, 
                     color = class
                     )
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", 
       y = "Highway (mpg)"
       )
ggplot(data = mpg, 
       mapping = aes(x = cty, y = hwy, color = class)
       ) + 
  geom_point() + 
  theme_bw() + 
  labs(x = "City (mpg)", y = "Highway (mpg)")

Let’s Practice!

How would you make this plot from the diamonds dataset in ggplot2?


  • data
  • aes
  • geom
  • facet

Creating a Game Plan

There are a lot of pieces to put together when creating a good graphic.

  • So, when sitting down to create a plot, you should first create a game plan!

This game plan should include:

  1. What data are you starting from?
  2. What are your x- and y-axes?
  3. What type(s) of geom do you need?
  4. What other aes’s do you need?

Use the mpg dataset to create two side-by-side scatterplots of city MPG vs. highway MPG where the points are colored by the drive type (drv). The two ploits should be separated by year.

Code
ggplot(data = mpg,
       mapping = aes(x = cty,
                     y = hwy,
                     color = drv)
       ) +
  geom_point() +
  facet_grid(.~year) +
  labs(x = "city MPG",
       y = "highway MPG")+
  scale_color_discrete(name = "drive type",
                      labels = c("4-wheel","front","rear"))

Graphics

Graphics consist of:

  • Structure: boxplot, scatterplot, etc.

  • Aesthetics: features such as color, shape, and size that map other variables to structural features.

Both the structure and aesthetics should help viewers interpret the information.

What makes bad graphics bad?

  • BAD DATA.
  • Too much “chartjunk” – superfluous details (Tufte).
  • Design choices that are difficult for the human brain to process, including:
  • Colors
  • Orientation
  • Organization

What makes good graphics good?

Edward R. Tufte is a well-known critic of visualizations, and his definition of graphical excellence consists of:

  • communicating complex ideas with clarity, precision, and efficiency.
  • maximizing the data-to-ink ratio.
  • using multivariate displays.
  • telling the truth about the data.

Graphics

When creating graphics, we need to think carefully about how we make structure and aesthetic decisions.

Gestalt Principles

Gestalt Principles

Our brains have an amazing ability to create and perceive structure among visual objects.

  • This is commonly referred to as the Gestalt principles of visual perception.
  • This framework can help us think about how to create the most expressive and effective data visualizations:

Gestalt Principles

Objects with the same visual properties are assumed to be similar and are grouped together.


Use design elements such as shape and color to indicate groupings of the data.

Objects that are close together are perceived as a group.

Since physical distance connotes similarity, grouping bars on a chart can indicate similarities among their data.

Elements that are aligned (on the same line, curve, or plane) are perceived to be more closely related to each other than to other elements.

It is often easier for us to perceive the groupings if the shapes are curves, rather than lines with sharp edges.

Objects that appear to have a boundary around them are perceived as being related.

  • Add line boundaries or shading to group objects.

Objects that are connected, such as by a line, are perceived as a group.

  • Connect data together to indicate a relationship.
  • This connectedness is highly effective and often overrides other principles for group perception.
  • Every line plot is an example of connectedness.

Complex arrangements of visual elements are perceived as a single, recognizable pattern.

Objects are perceived as either standing out prominently in the foreground of an image or receding into the background.

  • Shading or color blocking can be employed to distinguish between the more important figure and less important ground features of an image.
  • Place elements of the most importance in the foreground figure.

Whatever stands out visually is perceived as the most important. It will grab our attention first and hold it for the longest.

  • Use design elements selectively to draw attention to the most important features of the data.

Gestalt Principles

Gestalt Hierarchy Graphical Feature
1. Enclosure Facets
2. Connection Lines
3. Proximitiy White Space
4. Similarity Color/Shape

Implications for practice:

  • Know that we perceive some groups before others.
  • Design to facilitate and emphasize the most important comparisons.

Pre-attentive Features

Pre-attentive Features


The next slide will have one point that is not like the others.


Raise your hand when you notice it.

Pre-attentive Features

Pre-attentive Features

Pre-attentive Features

Pre-attentive features are features that we see and perceive before we even think about it.

  • They will jump out at us in less than 250 ms.

  • E.g., color, form, movement, spatial location.

There is a hierarchy of features:

  • Color is stronger than shape.
  • Combinations of pre-attentive features may not be pre-attentive due to interference.

Double Encoding

No Double Encoding

Color

Color

  • Use color to your advantage!
    • Color, hue, and intensity are pre-attentive, where bigger contrasts lead to faster detection.
    • Hue: main color family (red, orange, yellow…)
    • Intensity: amount of color

Color Guidelines

  • Use mappings from data to color that are numerically and perceptually uniform.
  • Be conscious of what certain colors “mean”. Colors are psychological!
  • Do not use rainbow color gradients.
  • Avoid using green-yellow-red color schemes – you might have audience members who are color deficient!

Color Guidelines

  • To colorblind-proof a graphic, try:
    • double encoding - when you use color, also use another aesthetic (line type, shape, etc.).
    • using a monochromatic color gradient.
      • If you have a bidirectional scale (e.g., showing + and - values), use purple-white-orange. Always transition through white!
    • printing your chart out in black and white – if you can still read it, it will be safe for colorblind users.

Gradients

Usually no more than 7 colors:

Can use colorRampPalette() from the RColorBrewer package to produce larger palettes by interpolating existing ones

Use color gradient with only one hue for positive values:

Use color gradient with two hues for positive and negative values. Gradient should go through a light, neutral color (white).

Color in ggplot2

There are several packages with color scheme options:

  • Rcolorbrewer
  • ggsci
  • viridis
  • wesanderson

These packages have color palettes hthatare aesthetically pleasing and, in many cases, colorblind friendly.

You can also take a look at other ways to find nice color palettes.

To do…

  • Lab 2: Exploring Rodents with ggplot2
    • due Monday, 1/22 at 11:59pm (remember, I will not be reachable Monday)

Strike – Week 3

All material will be posted by Friday afternoon. I will be unreachable Monday - Friday.

Please be patient afterwards as I catch up on emails and work.

  • Read Chapter 3: Data Cleaning and Manipulation
    • Check-in 3.1 due Tuesday (1/23) at 8am
  • PA 3: Identify the Mystery College
    • due Thursday (1/25) at 8am
  • Lab 3: XXX
    • due Monday (1/29) at 11:59pm